Add batch inferencing support for GPT2LMHeadModel by changjonathanc · Pull Request #7552 · huggingface/transformers

changjonathanc · 2020-10-03T10:48:36Z

What does this PR do?

This adds correct (absolute) positional embedding to the output, when given attention mask. The positional embedding is calculated using attention mask.
Fixes #3021
Here is an example usage:

from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2', return_dict=True)

# when generating, we will use the logits of right-most token to predict the next token
# so the padding should be on the left
tokenizer.padding_side = "left" 
tokenizer.pad_token = tokenizer.eos_token # to avoid an error

sentences = ["Hello, my dog is a little",
            "Hello, my dog is", # use different length sentences to test batching
            ]
inputs = tokenizer(sentences, return_tensors="pt", padding=True)


output_sequences = model.generate(
    input_ids=inputs['input_ids'],
    attention_mask=inputs['attention_mask'],
    do_sample=False, # disable sampling to test if batching affects output
)

for i in range(len(sentences)):
    print(tokenizer.decode(output_sequences[i]))
    # you can use skip_special_tokens=True in decode() to remove padding token
    # but note that it will also remove other special_tokens

outputs:

Hello, my dog is a little bit of a mess. I'm not sure if he's going
<|endoftext|><|endoftext|>Hello, my dog is a little bit of a mess. I'm not sure if he

comment:

I think this should be used in examples/text-generation/run_generation.py, but I don't know much about other models, and it (code) would be weird if only gpt2 supports batch inferencing.

albert, bert, GPT2, XLM: @LysandreJik
TextGeneration: @TevenLeScao
documentation: @sgugger
@patrickvonplaten

changjonathanc · 2020-10-09T06:22:30Z

This enables significantly faster generation.
Here is a simple test I ran.

	generate 20 tokens	generate 100 tokens
batch size = 1	45.2 s	3min 42s
batch size = 32	2.25 s (20x)	8.36 s (26.5x)

# following above code
data = sentences * 128 # total 256 sentences
model.cuda();
data = [' '.join([x]*10) for x in data] # make the prompt longer to be more realistic
from tqdm.auto import tqdm

def test(batchsize = 1, max_gen_len = 20):
    for i in tqdm(range(0, len(data), batchsize)):
        batch = data[i: i+batchsize]
        inputs = tokenizer(batch, return_tensors="pt", padding=True)

        output_sequences = model.generate(
            input_ids=inputs['input_ids'].to(model.device),
            attention_mask=inputs['attention_mask'].to(model.device),
            do_sample=False, # disable sampling to test if batching affects output
            pad_token_id=tokenizer.eos_token_id,
            max_length=len(inputs['input_ids'][0]) + max_gen_len, # let it generate longer
        )
        outputs = [tokenizer.decode(x) for x in output_sequences]


%time test(1, 20)

%time test(32, 20)

%time test(1, 100)

%time test(32, 100)

patrickvonplaten · 2020-10-13T21:47:01Z

Hey @cccntu - this is a great addition! I very much like your appraoch here.
I also checked that all GPT2 SLOW tests function correctly and added a test to make sure batch generation works as expected!

With the current implementation, the user would not be able to define his own position_ids for generate, since they are always overwritten in the prepare_input_ids_for_generation, but I think this is OK because:

Previously, it was impossible for the user to use position_ids because they would have to be extended by 1 each generation step - a feature which is not implemented
I don't see any reason why position_ids should be different from the way it is implement in the PR right now

@LysandreJik - this feature was heavily requested by the community (linked a couple of issues below) and I think this is a great way to handle GPT2 batch generation. What do you think?

patrickvonplaten · 2020-10-13T21:59:05Z

Related issues: #6742, #4746,
#4824

patrickvonplaten · 2020-10-13T22:02:16Z

@cccntu - Great work on this PR! If this PR is merged and you want to help the community a tiny bit more, you could give a short description (similar to what you've done above) on how to do batch generation with GPT2 here: https://discuss.huggingface.co/t/batch-generation-with-gpt2/1517. Many people have been asking for this so they would be very glad to see a short forum post about it.

Thanks a lot again!

changjonathanc · 2020-10-14T05:00:21Z

src/transformers/modeling_gpt2.py

+        position_ids = kwargs.get("position_ids", None)
+
+        if attention_mask is not None and position_ids is None:
+            # create postion_ids on the fly for batch generation
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if past:
+                position_ids = position_ids[:, -1].unsqueeze(-1)
+        else:
+            position_ids = None


@patrickvonplaten
Now that you add
position_ids = kwargs.get("position_ids", None)
I think we can get rid of
else: position_ids = None

Also inspired by this related PR #7355, I think we should move all the if past together, just above return

Should I add another commit?

No strong opinions on this, will let @patrickvonplaten decide to merge with or without this

@cccntu - yeah I thought about this as well. The problem with this and PR #7355 and passing position_ids is that we would have to incrementally add new tokens to position_ids in generate() which would be pretty hacky since not all models support position_ids => so I'd rather not do this before doing a bigger refactor of generate, see: #6949 (will continue on the bigger refactor soon).

We can always change that later without breaking backwards compatibility.

LysandreJik

This is great, very simple implementation! Thanks a lot @cccntu.

patrickvonplaten · 2020-10-14T11:40:20Z

Awesome, great work @cccntu ! It would be amazing if you could write a little description of how your PR works on the forum: https://discuss.huggingface.co/t/batch-generation-with-gpt2/1517 - the community would be very thankful I think :-)

changjonathanc · 2020-10-14T14:13:46Z

@patrickvonplaten Thanks for the suggestions! I just added some description to the forum post. 😃

link to the post for future reference: https://discuss.huggingface.co/t/batch-generation-with-gpt2/1517/2

LSinev · 2020-10-19T07:54:20Z

Can you please add batch inferencing for GPT2DoubleHeadsModel too?

spate141 · 2021-03-01T16:38:43Z

@patrickvonplaten @cccntu

I can see how batch generation is now available. I was wondering, if there's already a way to do the same but with different arguments of max_len & min_length per encoded_text in a batch in model.generate(). Goal here is to generate new text for a batch of encoded text with variable size.

changjonathanc · 2021-03-02T00:36:04Z

Hi @spate141,

Did you mean passing a max_len & min_length as n-element array?
It would fail here:

transformers/src/transformers/generation_utils.py

Line 289 in 121dd43

    
           assert isinstance(min_length, int) and min_length >= 0, "`min_length` should be a positive integer."

Actually, the main issue is here:

transformers/src/transformers/generation_utils.py

Line 539 in 121dd43

next_token_logits = outputs.logits[:, -1, :]

We need the right-most logits not be padding, and without modifying generation_utils.py, we need to use left-padding, and consequently we need this PR to make sure the positional embedding is correct.

You can also checkout the discussions in #3021, or the forum post: https://discuss.huggingface.co/t/batch-generation-with-gpt2/1517/3

spate141 · 2021-03-02T14:25:29Z

Did you mean passing a max_len & min_length as n-element array?

Yes, exactly! Instead of single int values for all texts in a batch... an array of values for each text in a batch.

I saw the code and I can see why it will fail. #3021 seems informative, I'll take a look.

Meanwhile I found this way to get what I mentioned:

Let's assume a model accepts input of max_len = 64 and we want to generate new text for a piece of text of size 300 tokens.
Since we know what's the max_len is, we have make sure that we split our input text into 5 batches: [64, 60, 58, 50, 56, 12].
- This was done in some clever way to ensure that each text segment follows valid grammar rule and also don't go above that max_len limit.
For all these 6 text segments we want to generate new text with following min, max values:
- min_values: [100, 100, 100, 100, 100, 25]
- max_values: [120, 120, 120, 120, 120, 50]
To do that, I can just pass a global min & max values (i.e. 100, 120 respectively) to model.generate() along with a tokenized batch of input text segments.
- input_ids_shape: (6, 64), min_len: 100, max_len: 120
My only issue here is regarding last text segment in a batch of (6, 64) tokenized tensor. Ideally, we want new generated text of size min of 25 tokens and max of 50 tokens. Generating a new text of size 100 tokens from an input of 12 tokens will be gobbledygook.
To handle this, I can just take the last segment of generated text that belongs to our last input text; and split the text and discard everything above its ideal original min/max limit, i.e. (25, 50)

OR

I can just go with doing same but I combine first 5 text segments and generate text on (5, 64) and generate text for the last one (1, 64) in two pass

OR

I can just generate everything in 6 pass for each 6 text segments and pass their ideal individual min/max limits

@cccntu In your 2nd comment to this pull request, you posted some impressive results on why doing batch_generation is ideal, specially let's say when you have a GPU. I'm just trying to figure out if doing the same in my case is worth the latency when I have to do some post-processing. I'll post some latency results once I have this setup ready.

spate141 · 2021-03-02T20:58:24Z

Update: @cccntu

I went with my 1st approach where I'm generating text for all texts in a single batch with global min, max values. In most cases where my last text chunk in batch is smaller meaning its min/max values are smaller than rest of text chunks in a same batch; I'm just trimming tokens. Results are impressive so far. Some numbers just in case someone stumble upon this thread in future:

Fixed size text batches:

This shows when passing list of text chunks as single batch tensor Vs passing text chunks as individual in for loop. max_len, min_len variables are kept same in both. Y-axis shows total time in seconds for model to finish generating text.
All the text chunks are of same size.

Variable size text batches:

Same as above, but here I'm using variable size text chunks.
For example: 2 Long, 1 Short means my input is 2 long size texts + 1 short size text. This is to test what happens when I'm generating text for variable size text chunks in a single batch.
Also to note that I'm trimming generated text for short text chunks in post processing. So, time on Y-axis include that.

Overall, batch text generation seems very useful(🎉) despite one has to add some overhead on top to manage some use cases.

callzhang · 2022-01-24T07:26:04Z

@cccntu Thanks for your great work! I stumbled upon this thread and would like to know:

Would this batching mechanism works for GPT-NEO?
Would this batching mechanism works for pipeline inference?
If so, is there any changes or considerations I need to do or know?

thomas-li-sjtu · 2022-03-03T08:06:14Z

Thanks for the code! I wonder if now I could generate sentences in a batch withother models (BertGeneration, for instance)? Looking forward to your reply!

irasin · 2023-01-19T07:54:34Z

@cccntu Thanks for your code. By using the correct position_id in this case, we can do batch inference in pytorch model now.

But when we export the gpt2 model to onnx with GPT2OnnxConfig

onnx_config = GPT2OnnxConfig(model.config)
## or using past_key_values mode
# onnx_config = GPT2OnnxConfig(model.config, use_past=True)

Then the onnx model inputs don't contation position_id but only input_ids nand attention_masks。
So we can't do correct batch_inference with onnx model now, right？

williamLyh · 2023-01-29T20:31:44Z

Thank you for the code. I wonder if you have tested whether there is performance drop when using batch generation? Especially when the GPT-2 model is finetuned with right-padded data.

Add support for gpt2 batch inferencing

8393596

LSinev mentioned this pull request Oct 8, 2020

Add token_type_ids to prepare_inputs_for_generation for gpt/gpt2 #7355

Closed

add test

729733b

remove typo

380af05

patrickvonplaten requested a review from LysandreJik October 13, 2020 21:52

changjonathanc commented Oct 14, 2020

View reviewed changes

LysandreJik approved these changes Oct 14, 2020

View reviewed changes

patrickvonplaten merged commit 121dd43 into huggingface:master Oct 14, 2020

NielsRogge mentioned this pull request Mar 14, 2021

How to generate texts in huggingface in a batch way? #10704

Closed

bnurbekov mentioned this pull request Nov 25, 2021

Logits warper for batch generation #14530

Closed

niansong1996 mentioned this pull request Nov 29, 2021

GPT model generate() function not correctly skipping the padding tokens indicated by attention_mask #14521

Closed

NielsRogge mentioned this pull request Apr 5, 2022

Improving T5 Docs #16614

Closed

NielsRogge mentioned this pull request Jul 21, 2022

The problem in BATCH generation of GPT model #18211

Closed

JIBSIL mentioned this pull request Apr 2, 2024

Slow Kaggle Performance (2x T4) unslothai/unsloth#260

Closed

Conversation

changjonathanc commented Oct 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

changjonathanc commented Oct 9, 2020

Uh oh!

patrickvonplaten commented Oct 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Oct 13, 2020

Uh oh!

patrickvonplaten commented Oct 13, 2020

Uh oh!

changjonathanc Oct 14, 2020

Choose a reason for hiding this comment

Uh oh!

LysandreJik Oct 14, 2020

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Oct 14, 2020

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented Oct 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changjonathanc commented Oct 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LSinev commented Oct 19, 2020

Uh oh!

spate141 commented Mar 1, 2021

Uh oh!

changjonathanc commented Mar 2, 2021

Uh oh!

spate141 commented Mar 2, 2021

Meanwhile I found this way to get what I mentioned:

Uh oh!

spate141 commented Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

callzhang commented Jan 24, 2022

Uh oh!

thomas-li-sjtu commented Mar 3, 2022

Uh oh!

irasin commented Jan 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

williamLyh commented Jan 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

changjonathanc commented Oct 3, 2020 •

edited

Loading

patrickvonplaten commented Oct 13, 2020 •

edited

Loading

patrickvonplaten commented Oct 14, 2020 •

edited

Loading

changjonathanc commented Oct 14, 2020 •

edited

Loading

spate141 commented Mar 2, 2021 •

edited

Loading

irasin commented Jan 19, 2023 •

edited

Loading